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Abstract 

This paper performs the analysis necessary to bound the running time of known, 
efficient algorithms for generating all longest common subsequences. That is, we bound 
the running time as a function of input size for algorithms with time essentially propor- 
tional to the output size. This paper considers both the case of computing all distinct 
LCSs and the case of computing all LCS embeddings. Also included is an analysis of 
how much better the efficient algorithms are than the standard method of generating 
LCS embeddings. A full analysis is carried out with running times measured as a func- 
tion of the total number of input characters, and much of the analysis is also provided 
for cases in which the two input sequences are of the same specified length or of two 
independently specified lengths. 
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1 Background and Terminologies 

Let A = a\CL2 ■ ■ ■ a m and B = bib 2 . . . b„ (m < n) be two sequences over an alphabet X. A 
sequence that can be obtained by deleting some symbols of another sequence is referred to as 
a subsequence of the original sequence. A common subsequence of A and B is a subsequence 
of both A and B. A longest common subsequence (LCS) is a common subsequence of greatest 
possible length. A pair of sequences may have many different LCSs. In addition, a single 
LCS may have many different embeddings, i.e., positions in the two strings to which the 
characters of the LCS correspond. 
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Most investigations of the LCS problem have focused on efficiently finding one LCS. A 
widely familiar 0(mn) dynamic programming approach goes back at least as far as the early 



1970s [19. p2|, 53], and many later studies have focused on improving the time and/or space 



required for the computation (e.g. @, [TT|, g]| p|, g g, 0, g P, ||, |, |, 0, |T], §. 



The familiar dynamic programming approach provides a basis for generating all LCSs, 
but the naive approach (e.g. |l|) may generate the same LCS or even the same LCS embed- 
ding many times. Other methods have been developed to generate a listing of all distinct 
LCSs or all LCS embeddings in time proportional to the output size (plus a preprocessing 
time of 0{mn) or less), i.e., without generating duplicates |fj, ||, [TOfl . But prior works 
give no indication of how long the running time may be as a function of input size. They 
also do not indicate how the asymptotic time of the efficient methods compares to the naive 
approach, so it is unclear how worthwhile it is to implement the more complex algorithms. 

Section ^| of this paper obtains bounds on the amount of time that may be required to 
find all distinct LCSs of two input sequences of fixed total length when using an algorithm 
with time proportional to the output size. Technically, the time is governed by the LCS 
length times the number of LCSs, but we focus on bounding just the maximum possible 
number of distinct LCSs. (There will be little difference in the LCS lengths that maximize 
these two measures.) 

Section |3| similarly bounds the amount of time that may be required to find all LCS em- 
beddings. Here an exact computation of the maximum possible number of LCS embeddings 
is provided, and the analysis is carried out both for a fixed total number of input charac- 
ters and for two input sequences of the same fixed length. In addition, a partial analysis is 
provided for two input sequences with independently specified lengths. Since the maximum 
number of LCS embeddings is achievable when there is just one distinct LCS, the results in 
this section also give a measure of how much more efficient it is to generate all distinct LCSs 
in time proportional to the output size, as compared to a method that efficiently generates 
all embeddings and removes duplicate LCSs. 

Section |] indicates how much more efficient a fast algorithm for generating all LCS 
embeddings (or all distinct LCSs) may be in comparison to the standard method that may 
even report the same embedding more than once. It turns out that the naive algorithm 
may even generate the same embedding of a single LCS exponentially many times, and we 
precisely quantify the asymptotic worst-case overhead. 



2 Bounding the Number of Distinct LCSs 

In this section we determine how much time an efficient algorithm for listing all distinct 
LCSs may require. While the actual time is the LCS length I times the number of distinct 
LCSs (plus preprocessing time), we focus here on bounding just the maximum number of 
distinct LCSs. The values of I that maximize these two measures will be increasingly close 
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as the sizes of the input sequences (and consequently I) grow. Throughout this section, we 
will let D(t) denote the maximum possible number of distinct LCSs for two input sequences 
of total length t (assuming an unbounded alphabet). 

Letting m = \ t/2\ and z — (— [t/2j ) mod 3, a lower bound on D(t) follows from consid- 
ering two input sequences of length m of the form Xef ghijklm... and Ygf ej ihmlk..., where 
X and Y are empty if z = 0, ab and ba if z = 1, or abed and bade if z = 2: 

Theorem 1 For t > A, D(t) > 3 (L * /2J — 2 CC— L*/2J ) mod 3))/3 2 (-L*/2j) mod 3 ( w hi c h implies that for 
t divisible by 6, D(t) > 3'/ 6 > 1.2*/ ■ 

To obtain an upper bound, we begin with the following lemma. For this purpose, we 
define an embedding of a character of an LCS in the two input sequences A and B as an 
ordered pair of a position in A and a position in B from which the character may be selected 
when forming the LCS. We say that two character embeddings (p, q) and (p', q') cross if 
p < p' A q' < q or p' < p A q < q' . 

Lemma 2 Consider two LCSs starting with different characters and any embedding of these 
two LCSs. The embeddings of the initial characters of these two LCSs must cross. 

Proof. Suppose there are embeddings of two LCSs C = Cic 2 . . . q and C = c\d 2 . . . c[ such 
that Ci 7^ c[ and the embeddings of c% and c\ do not cross. Then cC or c'C is a common 
subsequence, contradicting the assumption that C and C are LCSs. ■ 

Theorem 3 D(t) < 4*/ 5 < 1.32*. 

Proof. The proof is by induction on t; the base case is easy to check. For the induction step, 
let k be the number of choices for the first character when constructing an LCS from the two 
given input strings. Since the embeddings of all these possible initial characters must cross, 
the sum of the two string positions corresponding to each such embedding must be at least 
k + 1. Furthermore, once such an embedding is chosen for the first character of the LCS, k+1 
characters of the input strings are removed from consideration for construction of the rest of 
the LCS, since no other character of the LCS can have an embedding that crosses the first 
one. Thus, D(t) < kD(t-(k + l)). By the induction hypothesis, D(t) < M^"^ +1 ^/ 5 . Then 
the result follows from the observation that k/A^ k+1 ^ 5 is decreasing for k > 5/ In 4 w 3.6 and 
is at most 1 for integral k < 4. ■ 

Neither Theorem [I] nor Theorem ||] is tight; e.g., with t = 10, there are 7 distinct LCSs 
for the input strings abeda and cbadc. If neither input string contains repeated characters, 
however, we can combine Theorem |l| with the following theorem to obtain tight upper and 
lower bounds: 
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Theorem 4 D(t) < 3<L*/2j-2((-L*/2j) mod 3))/3 2 (-L*/2j) mod 3 y there are no repeate d characters 
in either input sequence. ■ 

Proof. We proceed by induction as in Theorem ^, but now when we make one of k choices 
for the first character of the LCS, we eliminate 2k characters from possible use in the rest 
of the LCS. Thus, D(t) < kD(t — 2k). Using the induction hypothesis and considering the 
different cases for the values of \t/2\ and k modulo 3, the result follows as long as we can 
show that fc3(-fc+2((-fc)mod3))/3 2 -((-fe)mod3) < L We succeec i by checking k = 1, 2, 3, 4, and 

5 and noting that k/3 k ' 3 is decreasing for /c>3/ln3~2.7. ■ 

Note that the results in this section also apply in the case that we require m = n, by 
setting t = 2n. 



3 The Maximum number of LCS Embeddings 

In this section, we determine how much time an efficient algorithm for listing all LCS em- 
beddings may require. Utilizing the same justification as in the previous section, we neglect 
LCS length as a component of the running time. (It would actually be easy to incorporate 
LCS length into the presentation in this section, and this may be done in the full paper.) 
Thus we focus on computing the maximum possible number of LCS embeddings. In the full 
paper, we will argue that the maximum number of LCS embeddings can be achieved when 
there is just one distinct LCS. Therefore, we turn our attention to computing the maximum 
possible number of embeddings of a single LCS. This result will also indicate how much more 
efficient it is to generate all distinct LCS embeddings in time proportional to output size 
rather than to generate all embeddings (efficiently) and remove duplicates. 

We begin by determining the maximum number of embeddings of an LCS of length I in 
two input strings of length m and n. Then we perform the maximization over I in the cases 
of (1) m = n and (2) m and n variable but with m + n fixed at t. 

Lemma 5 The maximum possible number of embeddings E(n, m, I) of a single LCS of length 
I in two input sequences of lengths m and n is 

j?( n fm-y\fn + y-l\ 

hi{n,mA) = max I 

v<i\l-yj\ y J 

Proof. First, E(n,m,l) > max^ r^') ' because, for any y < I, we can find 

(Ty) embeddm S s of the strin g ^ y ^ y in the two strings a" 1 "^ and a ^ y b n+ ^ 

(where x n represents n repetitions of the character x). 

Now we prove E(n,m,l) < max y <i (jlyji^^J as follows. Each character of any LCS 
must have a fixed embedding in at least one of the two input strings A = a\a%. . .a m and 
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B = b\bi . . . b n . (Suppose, to the contrary, that Ck of the LCS C = C\C2 ■ ■ ■ Q could be 
embedded into a; or a,j (i < j) and into b p or b q (p < q). Then Cic 2 . . . c^-i could be 
embedded in aia^ ■ ■ ■ Oj_i and in &i& 2 ---V-i, while Ck+iCk+2 ■ ■ • Q could be embedded in 
aj + idj + 2 ■ ■ ■ a m and in b q+ ib q+ 2 ■ ■ ■ b n . This contradicts the supposition that C is an LCS, 
because we now know that C1C2 ■ ■ ■ Ck-iCkCkCk+iCk+2 ... Q is a common subsequence of A and 
B.) Let y be the number of characters of the LCS under consideration that have a fixed 
embedding in A. Then at least I — y characters have a fixed embedding in B. Now the 
number of ways to embed those I — y characters in A is at most (7^?f) > anc ^ ^ ne num ber of 
ways to embed into B the y characters fixed in A is at most (" ^ ' 

Lemma 6 The maximum possible number of embeddings E(n, m, I) of a single LCS of length 
I in two input sequences of lengths m and n is 

^Hr;")("T')> 

where 

l{n — I) + I — m 
m + n — 21 

Proof. The result follows from Lemma [5] as long as we can show that y* (which satisfies 
y* < I) is the (nonnegative integral) value of y that maximizes P(y) = [jlyj • To do 

this, we show that P(y + 1) < P(y) if and only if y > ;< -^j^~ m as follows: 

P( +n<PM (m-y-l)\ (n + y + l-l)\ (m - y)\ (n + y - Q! 

W) » (l-y-l)\( m -l)\(y + l)\(n-l)\- (l-y)\(m-l)\ y\(n-l)\ 
^ (l-y){n + y-l + l)<(m-y)(y + l) 

-y 2 + y(l - n + l - 1) + In - l 2 + l < -y 2 + y(m - 1) + m 
<^=^> Z(n — /) + / — m < y(m + n — 21) 

(Note that m + n — 21 > 0, and where it equals (n = m = I), the result remains correct as 
long as we use the convention of interpreting jj as 1.) ■ 

Next we specialize to m = n. 

Lemma 7 The maximum possible number of embeddings E(n, n, I) of a single LCS of length 
I in two input sequences of length n is 

F ,„ „ n - (" - W 2 i\ (" - r</2l\ _ In - [l/2\\ In - \l/2]\ 
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Proof. Substituting n for m in Lemma |6| gives y* = \{l — l)/2] = \l/2\. Then substituting 
for m and y* in the expression for E(n, m, I) there gives the desired result after using basic 
facts about floors and ceilings and the relationship (j^j 



k\(r-k)\- 



Lemma 8 Let a = (Bn — 1 — J~5{n + l) 2 — 4)/5 and r = (5n — y/5(n + l) 2 )/5. Then, the 
maximum possible number of embeddings of a single LCS in two input sequences of length n 
is achieved with an LCS length I* that satisfies the following conditions. 

1. I* can be chosen as either o or o + 1 if a is integral. 

2. Otherwise, I* = \a] if \o~] is even. 

3. Otherwise, I* = |Y] (which most often equals \o~~\). 

Proof. From Lemma [7], we see that the condition E(n,n,l + 1) < E(n,n,l) is equivalent 
to: 
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and 



2+1 



Now we see that the condition E(n, n, I + 1) < n, /) is equivalent to: 
5/ 2 - 2(5n - 1)/ + 4n(n - 1) 



IQnl + 4n 2 - 2n - 1 



< 
< 








for Z even 
for I odd . 



(1) 
(2) 

For each of these two quadratic expressions in /, the roots r% and r2 satisfy — 1 < r\ < n 
and n < r 2 . Since I < n, in each of the cases of Z even and I odd, an (integral) value of / 
maximizing E(n,n,l) is \r{\ as long as that value is of the appropriate parity. The values 
of ri are a and r, in (II) and (g), respectively, and it may be noted that r is never integral, 
since ^/E is irrational. 

Now, if a is an even integer, we see that according to ([[]), I* is selectable as a or a + 
1. Furthermore, this is the final result, since the odd value \t\ that maximizes E(n,n,l) 
according to (§) is equal to a + 1. 

It is easy to check that any time a is integral, it is even, so it only remains to consider 
cr nonintegral. We see that if \a] is even, then according to (0), I* = a. Furthermore, (@) 
will not lead to a better value of E(n, n, I), since a < t < a + ~. Finally, if \a] is odd, then 
either |Y] has the same odd value, or |Y] has the even value \a] +1; either way, we see 
from (I) and (D that I* = pr] . ■ 
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Theorem 9 The maximum possible number of embeddings of a single LCS in two input 
sequences of length n is 



|(l + l/V5)(n + l) 



5(n + l) 2 -4 ) /10 



(hn + 1 + ^5(71 + 1)2-4) /10j~ 
|(l-l/V5)(n + l) 



Proof. The result follows from Lemma [8| as follows. Since a < r < a + 1, we know 
[\ T ] /2J = [\a] /2J if [<t] is even. Thus, by Lemma |, [I* /2\ = [\r] 2j. Similarly, there is 
an acceptable I* with |7*/2] = |~|~cr] /2]. Now we have the following four relationships: 

. \l*/2] = \\a] /2] = fa/21 

• n - rZ*/2] = n - r<T/2] = [n - <r/2j 

. [l*/2\ = LM /2j = LL(r + l)/2jj = L(r + 1)/2J 

• n - U*/2J = n - L(r + 1)/2J = n - [(r - l)/2] = [n - (r - 1)/2J 

Substituting these four relationships into E(n, n, I) = ("n//-| 2 "') ("u///') f rom Lemma yields 
the desired result. ■ 

Corollary 10 The limit as n goes to infinity of the maximum possible number of embeddings 
of a single LCS in two input sequences of length n is 

*V5 < & 



2n 



n « .932(2.62)7/1 



where <fi = (1 + v / 5)/2 f£/ie golden ratio). 

Proof. We will use Stirling's approximation to the factorial: 



n\ = V27m(n/e) n (l + e(l//i)) [0, p. Ill] . 
Then, the limit as n goes to infinity of the expression in Theorem is: 



(3) 



lim 



Ui + i/VE 



i(l-l/V5 



7r 



r?./ 



lim 

n— >oo 



lim 

/ 
\ 



0n/^5 \ 2 
- l)n/V5j 



((0-l)n/V5)! (n/V5) 



0^ 



^ (</>- 1)2*71 



by Eqn. 



Next, we consider the case in which the total number of characters in the two input 
strings is fixed, but the lengths of the individual strings are not. The following Lemma 
follows immediately from Lemma ||, using the fact K J K A < (l+ r k , 

Lemma 11 The maximum possible number of embeddings of a single LCS of length I in two 
input sequences of total length t is 



Using Lemma [11] and reasoning similar to the proof of Lemma [8], yields the following 
theorem. 



Theorem 12 The maximum possible number of embeddings of a single LCS in two input 
sequences of total length t is 



(5t + 3 + ^5^ + l) 2 + 4) /10 
5* - 3 - 750+1)2 + 4) /10 



Finally, we can use the same type of reasoning as in Corollary |10| to reach the following 
conclusion: 



Corollary 13 The limit as t goes to infinity of the maximum possible number of embed- 



dings of a single LCS in two input sequences of total length t is <W y/E/(2ir) (j/fy/i 
.965 (1.62)*/ y/i . 



4 The Degree of Inefficiency of Naively Generating all 
LCSs 

The standard "naive" method of computing the length of an LCS is a "bottom-up" dynamic 
programming approach based on the following recurrence for the length L[i, j] of an LCS of 
a\a 2 ... a, and hib 2 . . . bf. 

[ if i = or j = 

L[i,j] = I L[i — l,j — l]+l if i, j > and a» = bj (4) 

[ max{L[z — L[i,j — 1]} otherwise 

We refer to the L[i,j] value as the rank of and we call a match if = bj. In 
0(mn) time, one may fill an array with all the values of L[i, j] for 0<i<mA0<j<n, 
and the length I of an LCS is read off from L[m,n]. The same time bound also suffices to 
produce a single LCS by a "backtracing" approach starting from position [m, n] of the array. 
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At each stage we just step from position to a position [i — — 1], [i — or — 1] 
that is responsible for the setting of L[i,j] as per (f|); each match encountered generates a 
character of the LCS (in reverse order). 

The naive approach to generate all LCS embeddings |IJ would be to extend the backtrac- 
ing method as follows. (To generate all distinct LCSs, one could generate all embeddings 
and remove duplicate LCSs.) At each step, we would consider three possibilities (and con- 
tinue recursively); from position we could add a character to the LCS and move to 
[i — 1, j — 1] if is a match, and we could move to [i — or — 1] if the rank there 
equals L[i,j] (without adding a character to the LCS and regardless of whether is a 
match). Whenever we reach [0,0], we can print out an LCS embedding. We could make 
some simple improvements such as stopping each backtrace path at any position of rank 0, 
but this will not change the basic degree of inefficiency as expressed in the theorem below. 
(Note that output size is always at least 1 rather than 0, because the empty string e is a 
common subsequence of any pair of input sequences.) 

Theorem 14 The naive method of generating all LCS embeddings (or all LCSs) may require 
time exceeding the output size by a factor of ©(("m" 1 )) in the worst case. 

Proof. For the upper bound, consider the "normalized" time N[i,j], representing the time 
to complete the naive backtrace procedure from position divided by max{l, L[i, j}}. 

An induction argument shows that there are positive constants c and d such that N[m, n] < 
°\Vf) — It is easy to choose constants and obtain N[i,j] < c \^j ~ d for any with 
< 1 or j < 1. Included in this result is that N[i, j] < c(*t J ) ~~ d for i + j < 3. We then 
complete the induction by showing, for an arbitrary m, n > 2, that N[m,n] < c( n ^ m ) — d 

given that N[i, j] < c ( H f') — d for i+j < n + m. For this final step, we perform the following 
case analysis, with I denoting the rank of [m, n] . 

Case I: [m, n] is not a match. Then N[m,n] < N[m — l,n] + N[m, n — 1] + 0(1). (It 
is easy to see that this relationship holds if N represents ordinary time, since the traceback 
from [m, n] does not need to add anything to the outputs generated in the tracebacks from 
[m — l,n] and [m, n — 1]. The relationship then holds for normalized time, since the ranks 
of [m — l,n] and [m,n — 1] can be no higher than /.) The induction step can be completed 
by invoking the induction hypothesis and using the familiar identity 

(5) 

Case II: [m, n] is a match. The following three subcases cover all possibilities (albeit with 
overlap between cases IIA and IIB). 

Case IIA: [m — 1, n] is not of rank /. Then, we have N[m, n] < N[m — 1, n — 1] + N[m, n — 
1] + 0(1). (Here, this relationship would not be valid with N representing ordinary time, 




9 



because every output produced in the traceback from [m — 1, n — 1] must be augmented with 
an additional character corresponding to the match at position [m, n] . But with normalized 
time, the relationship can be justified as follows, [m — l,n — 1] is of rank I — 1, so (/ — 
l)N[m — 1, n — 1] is an upper bound on the amount of time spent in the backtrace from 
[m — 1, n — 1]. Furthermore, iV[m — 1, n — 1] is an upper bound on the number of outputs 
produced in the backtrace from [m — l,n — 1] and therefore on the amount of extra time 
appending a single extra character to each output. Since [m, n — 1] is also of rank at most I 
and we need not add anything to the outputs of the backtrace from [m, n — 1], the desired 
relationship holds.) We can now complete the induction step in a similar fashion to Case I. 

Case IIB: [m, n — 1] is not of rank I. This case is completely analogous to Case IIA. 

Case IIC: [m — 1, n] and [m, n — 1] are both of rank Z. Then 

N[m, n] < N[m -l,n} + N[m - 1, n - 1] + iV[m, n - 1] + 0(1) . (6) 

Furthermore, since [m, n] is a match, [m — 1, rt — 1] is at rank Z — 1, which is lower than the 
ranks of [m — 1, n] and [m, n — 1]. Thus, 

iV[m - 1, n] < iV[m - 2, n] + iV[m - 2, n - 1] + 0(1) (7) 

iV[m, n — l]<JV[m,7i-2]+JV[m-l,n-2]+0(l) (8) 
Combining, Equations || ^, and |[ we have 

iV[m,7i] < iV[m-2,n]+iV[m-2,n-l]+iV[m-l,n-l]+JV[m-l,7i-2]+iV'[m,n-2]+0(l) 

Now we are again able to complete the induction step as in Case I, using Equation [|] several 
times. 

From the case analysis, we have concluded that the normalized time N[m, n] is 0(( nJ ^ n ^J). 
Since the true output size of listing even just distinct LCSs is at least I = L[m,n], the 
overhead of the naive algorithm is O ( ( n ^ m ) ) • 

For the lower bound, note that an overhead of @(( n „ m )) ^ s achievable by simply choos- 
ing sequences with no matches. Furthermore, even if we make the backtracing procedure 
less naive by printing outputs whenever we hit a node of rank 0, we still would spend 
^(("m"^ 2 )) = ^(("m")) time for a pair of input strings in which the only match is at [1, 1], 
while the true output size would be 1 even to list all embeddings of all LCSs. ■ 

We can now give a simple expression for the worst-case overhead of the naive algorithm on 
two input strings of equal length. This result follows from expressing ( 2 ™) from Theorem [TJ 

as l^jp- and using Equation [3]. 

Corollary 15 For two input strings of length n, naively generating all LCS embeddings (or 
all LCSs) may require time exceeding the output size by a factor o/@(4 n / v /n) in the worst 
case. ■ 
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Finally, we can recast the result for the case in which the total number of characters in 
the two input strings is fixed, but the lengths of the individual strings are not. 

Corollary 16 For two input strings of total length t, naively generating all LCS embeddings 
(or all LCSs) may require time exceeding the output size by a factor ofQ(2 t /^/t) in the worst 
case. 

Proof. From Theorem [TJ], the worst-case overhead is based on the maximum value of ( M , 
which is (rj 2 i)- Then we proceed as for Corollary [TJ. ■ 



5 Conclusion 

We have seen that the maximum number of distinct longest common subsequences for fixed 
input length is much less than the maximum number of LCS embeddings, which is much 
less than the maximum number of embeddings (including duplicates) obtained by generating 
embeddings by the standard method. Thus, it is much more efficient to generate all distinct 
LCSs or all LCS embeddings in time proportional to the output size than to use the standard 
method of generating LCS embeddings. 
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